AITopics | speech-to-speech translation

Collaborating Authors

speech-to-speech translation

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

DASpeech: Directed Acyclic Transformer for Fast and High-quality Speech-to-Speech Translation Qingkai Fang 1,2, Y an Zhou 1,2, Y ang Feng 1,2 1

Neural Information Processing SystemsFeb-17-2026, 16:32:38 GMT

In this paper, we propose DASpeech, a non-autoregressive direct S2ST model which realizes both fast and high-quality S2ST.

machine learning, natural language, translation, (19 more...)

Neural Information Processing Systems

Country:

Europe > Austria > Vienna (0.14)
Asia > South Korea > Incheon > Incheon (0.04)
North America > Canada > British Columbia > Vancouver (0.04)
(12 more...)

Genre: Research Report (0.46)

Technology:

Information Technology > Artificial Intelligence > Speech > Speech Recognition (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Machine Translation (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.93)

Add feedback

Self-Supervised Normalization for Non-autoregressive Speech-to-speech Translation

Neural Information Processing SystemsFeb-11-2026, 09:56:16 GMT

Non-autoregressive Transformers (NA Ts) are recently applied in direct speech-to-speech translation systems, which convert speech across different languages without intermediate text data.

machine learning, natural language, translation, (21 more...)

Neural Information Processing Systems

Country:

North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
North America > Canada > Ontario > Toronto (0.04)
South America > Colombia > Meta Department > Villavicencio (0.04)
(3 more...)

Genre: Research Report > Experimental Study (0.93)

Technology:

Information Technology > Artificial Intelligence > Speech > Speech Recognition (1.00)
Information Technology > Artificial Intelligence > Natural Language > Machine Translation (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.46)

Add feedback

RosettaSpeech: Zero-Shot Speech-to-Speech Translation from Monolingual Data

Zheng, Zhisheng, Sun, Xiaohang, Dinh, Tuan, Yanamandra, Abhishek, Jain, Abhinav, Liu, Zhu, Hadap, Sunil, Bhat, Vimal, Aggarwal, Manoj, Medioni, Gerard, Harwath, David

arXiv.org Artificial IntelligenceNov-27-2025

The scarcity of parallel speech corpora critically hampers speech-to-speech translation (S2ST), often forcing reliance on complex, multi-stage pipelines. This paper introduces RosettaSpeech, a novel and simplified framework for zero-shot S2ST that is trained on monolingual speech-text data augmented by machine translation supervision. While our method leverages the linguistic knowledge inherent in text-based NMT models, it strictly eliminates the need for parallel speech-to-speech pairs. Our model uniquely uses text as an intermediate bridge during training but functions as a direct, end-to-end speech-to-speech model at inference. This streamlined approach achieves state-of-the-art results on standard benchmarks. For instance, on the CVSS-C test set, RosettaSpeech outperforms leading systems, achieving an ASR-BLEU score of 25.17 for German-to-English and 29.86 for Spanish-to-English-relative gains of over 27% and 14%, respectively. Furthermore, we demonstrate that a single model can deliver strong many-to-one translation performance (FR/ES/DE -> EN). We also provide a foundational analysis of how training data scaling impacts model performance. By prioritizing reliance on abundant parallel text rather than difficult-to-acquire parallel speech, RosettaSpeech offers a scalable path to creating high-quality, speaker-preserving S2ST for a much broader array of languages.

artificial intelligence, natural language, translation, (16 more...)

arXiv.org Artificial Intelligence

2511.20974

Genre: Research Report > New Finding (0.46)

Technology:

Information Technology > Artificial Intelligence > Speech > Speech Recognition (1.00)
Information Technology > Artificial Intelligence > Natural Language > Machine Translation (1.00)

Add feedback

Improving Direct Persian-English Speech-to-Speech Translation with Discrete Units and Synthetic Parallel Data

Rashidi, Sina, Sameti, Hossein

arXiv.org Artificial IntelligenceNov-18-2025

Direct speech-to-speech translation (S2ST), in which all components are trained jointly, is an attractive alternative to cascaded systems because it offers a simpler pipeline and lower inference latency. However, direct S2ST models require large amounts of parallel speech data in the source and target languages, which are rarely available for low-resource languages such as Persian. This paper presents a direct S2ST system for translating Persian speech into English speech, as well as a pipeline for synthetic parallel Persian-English speech generation. The model comprises three components: (1) a conformer-based encoder, initialized from self-supervised pre-training, maps source speech to high-level acoustic representations; (2) a causal transformer decoder with relative position multi-head attention translates these representations into discrete target speech units; (3) a unit-based neural vocoder generates waveforms from the predicted discrete units. To mitigate the data scarcity problem, we construct a new Persian-English parallel speech corpus by translating Persian speech transcriptions into English using a large language model and then synthesizing the corresponding English speech with a state-of-the-art zero-shot text-to-speech system. The resulting corpus increases the amount of available parallel speech by roughly a factor of six. On the Persian-English portion of the CVSS corpus, the proposed model achieves improvement of 4.6 ASR BLEU with the synthetic data over direct baselines. These results indicate that combining self-supervised pre-training, discrete speech units, and synthetic parallel data is effective for improving direct S2ST in low-resource language pairs such as Persian-English

artificial intelligence, machine learning, natural language, (19 more...)

arXiv.org Artificial Intelligence

2511.1269

Country: Asia > Middle East > Iran (0.14)

Genre: Research Report (0.82)

Technology:

Information Technology > Artificial Intelligence > Speech > Speech Recognition (1.00)
Information Technology > Artificial Intelligence > Natural Language > Machine Translation (1.00)
Information Technology > Artificial Intelligence > Machine Learning (1.00)

Add feedback

StressTransfer: Stress-Aware Speech-to-Speech Translation with Emphasis Preservation

Chen, Xi, Song, Yuchen, Nakamura, Satoshi

arXiv.org Artificial IntelligenceOct-16-2025

EmphST -Bench To guide algorithm exploration and evaluate the performance of our model, we design an evaluation pipeline for the emphasis-preserving speech-to-speech translation system. Given the lack of ready-to-use benchmarks for this important task, we leverage LLMs to translate the test set from the StressTest [21] corpus into the target language and then filter the results via human experts. This process creates a high-quality benchmark dataset, EmphST -Bench, with manually verified emphasis alignments between source and target utterances, ensuring reliable assessment of cross-lingual emphasis preservation. The human filtering step focuses on correcting any discrepancies in semantic equivalence, contrastive focus, and emotional intensity, resulting in a robust evaluation set that closely mirrors real-world linguistic nuances. EmphST -Bench consists of carefully selected parallel samples from English (source) to Chinese (target), providing a standardized resource for evaluating stress-aware S2ST systems. We report the statistics of EmphST -Bench in Table. 1. T able 1: Statistics of the EmphST -Bench dataset.Statistic V alue Number of Samples 218 Avg.

large language model, machine learning, translation, (18 more...)

arXiv.org Artificial Intelligence

2510.13194

Country: Asia (0.46)

Genre: Research Report (0.64)

Technology:

Information Technology > Artificial Intelligence > Speech > Speech Recognition (1.00)
Information Technology > Artificial Intelligence > Natural Language > Machine Translation (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.31)

Add feedback

MTP-S2UT: Enhancing Speech-to-Speech Translation Quality with Multi-token Prediction

Wang, Jianjin, Zhao, Runsong, Liu, Xiaoqian, Ge, Yuan, Xu, Ziqiang, Xiao, Tong, Gao, Shengxiang, Yu, Zhengtao, Zhu, Jingbo

arXiv.org Artificial IntelligenceOct-14-2025

Current direct speech-to-speech translation methods predominantly employ speech tokens as intermediate representations. However, a single speech token is not dense in semantics, so we generally need multiple tokens to express a complete semantic unit. To address this limitation, we introduce multi-token prediction (MTP) loss into speech-to-unit translation (S2UT) models, enabling models to predict multiple subsequent tokens at each position, thereby capturing more complete semantics and enhancing information density per position. Initial MTP implementations apply the loss at the final layer, which improves output representation but initiates information enrichment too late. We hypothesize that advancing the information enrichment process to intermediate layers can achieve earlier and more effective enhancement of hidden representation. Consequently, we propose MTP-S2UT loss, applying MTP loss to hidden representation where CTC loss is computed. Experiments demonstrate that all MTP loss variants consistently improve the quality of S2UT translation, with MTP-S2UT achieving the best performance.

artificial intelligence, natural language, speech token, (15 more...)

arXiv.org Artificial Intelligence

2510.10003

Country:

Europe (0.68)
Asia > China (0.29)

Genre: Research Report (0.64)

Technology:

Information Technology > Artificial Intelligence > Speech (1.00)
Information Technology > Artificial Intelligence > Natural Language > Machine Translation (1.00)

Add feedback

Self-Supervised Normalization for Non-autoregressive Speech-to-speech Translation

Neural Information Processing SystemsOct-9-2025, 23:39:33 GMT

Non-autoregressive Transformers (NA Ts) are recently applied in direct speech-to-speech translation systems, which convert speech across different languages without intermediate text data.

diffusion model, speech unit, translation, (16 more...)

Neural Information Processing Systems

Country:

North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
North America > Canada > Ontario > Toronto (0.04)
South America > Colombia > Meta Department > Villavicencio (0.04)
(3 more...)

Genre: Research Report > Experimental Study (0.93)

Technology:

Information Technology > Artificial Intelligence > Speech > Speech Recognition (1.00)
Information Technology > Artificial Intelligence > Natural Language > Machine Translation (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.46)

Add feedback

Speech Vecalign: an Embedding-based Method for Aligning Parallel Speech Documents

Meng, Chutong, Koehn, Philipp

arXiv.org Artificial IntelligenceSep-24-2025

We present Speech Vecalign, a parallel speech document alignment method that monotonically aligns speech segment embeddings and does not depend on text transcriptions. Compared to the baseline method Global Mining, a variant of speech mining, Speech Vecalign produces longer speech-to-speech alignments. It also demonstrates greater robustness than Local Mining, another speech mining variant, as it produces less noise. We applied Speech Vecalign to 3,000 hours of unlabeled parallel English-German (En-De) speech documents from VoxPopuli, yielding about 1,000 hours of high-quality alignments. We then trained En-De speech-to-speech translation models on the aligned data. Speech Vecalign improves the En-to-De and De-to-En performance over Global Mining by 0.37 and 0.18 ASR-BLEU, respectively. Moreover, our models match or outperform SpeechMatrix model performance, despite using 8 times fewer raw speech documents.

artificial intelligence, machine learning, natural language, (17 more...)

arXiv.org Artificial Intelligence

2509.1836

Country:

Europe (1.00)
Asia > Middle East > UAE (0.46)
North America > United States > Minnesota (0.28)

Genre:

Research Report > New Finding (0.68)
Research Report > Experimental Study (0.47)

Industry: Materials > Metals & Mining (0.59)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Machine Translation (1.00)
Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Artificial Intelligence > Speech > Speech Recognition (0.89)

Add feedback

Transformer-Encoder Trees for Efficient Multilingual Machine Translation and Speech Translation

Guan, Yiwen, Whitehill, Jacob

arXiv.org Artificial IntelligenceSep-23-2025

Multilingual translation faces challenges of computational redundancy and limited accuracy for low-resource languages, especially in speech translation. To address this, we propose a novel hierarchical Transformer Encoder Tree (TET) combined with non-autoregressive encoder-only models trained with Connectionist Temporal Classification for multilingual translation. By sharing intermediate representations among linguistically similar target languages, TET can improve accuracy on low-resource languages, reduce computational redundancy, and allow generating all target languages in a single forward pass, thus eliminating sequential bottlenecks and improving parallelism. For speech translation, combining TET with a non-autoregressive speech recognition backbone (wav2vec2) shows promising results in terms of translation quality compared to autoregressive systems while being 7-14 times faster.

artificial intelligence, natural language, translation, (17 more...)

arXiv.org Artificial Intelligence

2509.1793

Genre: Research Report (0.83)

Technology:

Information Technology > Artificial Intelligence > Speech > Speech Recognition (1.00)
Information Technology > Artificial Intelligence > Natural Language > Machine Translation (1.00)

Add feedback

Filters

Collaborating Authors

speech-to-speech translation

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

e5b1c0d4866f72393c522c8a00eed4eb-Paper-Conference.pdf

DASpeech: Directed Acyclic Transformer for Fast and High-quality Speech-to-Speech Translation Qingkai Fang 1,2, Y an Zhou 1,2, Y ang Feng 1,2 1

Self-Supervised Normalization for Non-autoregressive Speech-to-speech Translation

RosettaSpeech: Zero-Shot Speech-to-Speech Translation from Monolingual Data

Improving Direct Persian-English Speech-to-Speech Translation with Discrete Units and Synthetic Parallel Data

StressTransfer: Stress-Aware Speech-to-Speech Translation with Emphasis Preservation

MTP-S2UT: Enhancing Speech-to-Speech Translation Quality with Multi-token Prediction

Self-Supervised Normalization for Non-autoregressive Speech-to-speech Translation

Speech Vecalign: an Embedding-based Method for Aligning Parallel Speech Documents

Transformer-Encoder Trees for Efficient Multilingual Machine Translation and Speech Translation